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Abstract. A standard goal of model evaluation and selection is to find 
a model that approximates the truth well while at the same time is as 
parsimonious as possible. In this paper we emphasize the point of view 
that the models under consideration are almost always false, if viewed 
realistically, and so we should analyze model adequacy from that point 
of view. We investigate this issue in large samples by looking at a 
model credibility index, which is designed to serve as a one-number 
summary measure of model adequacy. We define the index to be the 
maximum sample size at which samples from the model and those from 
the true data generating mechanism are nearly indistinguishable. We 
use standard notions from hypothesis testing to make this definition 
precise. We use data subsampling to estimate the index. We show that 
the definition leads us to some new ways of viewing models as flawed 
but useful. The concept is an extension of the work of Davies [Statist. 
Neerlandica 49 (1995) 185-245]. 

Key words and phrases: Model selection, statistical distance, boot- 
strap, model credibility index, normality. 



1. INTRODUCTION 

Our starting point is the famous quotation of 
G. E. P. Box: 

All models are wrong, but some are useful 
(1976). 

In this article we will take as our initial premise 
that "All models are wrong," and see where it leads 
us. A consequence of model falseness is that for every 
data generating mechanism there exists a sample 
size at which the model failure will become obvious. 
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Our second premise is that there are occasions 
when one will want to use, in some fashion, a model 
that is clearly false, provided that it provides a parsi- 
monious and powerful description of the generating 
mechanism. Here we wish to emphasize that we are 
interested in description, not prediction, as there is 
a smaller advantage to simplicity when the overar- 
ching goal is accurate prediction. 

In order to explore this question, the key assump- 
tion of this paper will be that the sample size un- 
der which the data is collected, say, n, is sufficiently 
large that many of the models under investigation 
are clearly false. This would seem to be a reason- 
able assumption in the modern data-mining envi- 
ronment. Just the same, we wish to measure the 
quality of their approximation to the true data gen- 
erating mechanism to see which ones most econom- 
ically capture its main features. Later in this paper 
we will use subsampling from the data as a means 
of replicating the true data generating mechanism. 

It is important to our theme that we are seeking 
to measure attributes that are completely unrelated 
to the value of n that generated the data at hand. 
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We emphasize this because the standard tools for 
model assessment are highly n-dependent. For ex- 
ample, hypothesis testing has played a prominent 
role in the assessment of the models since the de- 
velopment of Pearson's chi-squared statistic. Unfor- 
tunately, it is based on the false premise that the 
model is correct, and so for a large enough sample 
size, we are doomed to reject any fixed model. That 
is, if we view these tests as answers to the question: 
"Is this model useful?," then what we mean by use- 
fulness is clearly related to not just the quality of 
the model, but also the size of the sample that was 
used in its assessment. So hypothesis testing does 
not meet our need directly. 

In our approach we use testing methodology but 
in an inverted fashion. We treat the null hypothesis 
as being false, and ask questions about the power of 
the test statistic as a function of its sample size. We 
define our new index, called the model credibility in- 
dex, as the sample size needed to obtain a desirable 
power. Although the point of view is not new that 
the power of a test depends on the sample size, it is 
a novel idea to propose the sample size as a model 
evaluation index. 

Other standard risk analyses, the basis for AIC, 
Mallow's Cp and other methods are n-dependent 
because the goal there is to assess the quality of 
prediction using the fitted model. These criteria for 
model selection depend not just on the model itself, 
but also on the quality of the parameter estimation, 
which in turn depends on n. 

We hope that our new methods will be thought- 
provoking because they involve only standard tools 
of testing and risk assessment, so they could be read- 
ily understood (and constructed) by any statistician. 

Just the same, we think that our work presents a 
challenge to the standard statistical train of thought. 
Statisticians are quite accustomed to taking the 
"model true" point of view. After all, we have a 
huge box of statistical tools that are based on the 
assumption. This can make it hard for statisticians 
to maintain consistently a "model false, but maybe 
useful" point of view. 

For example, suppose we have a random sample 
Xi,X2,...,X„ with distribution r. In traditional 
model building much is made of the idea of con- 
sistency, in the sense of finding the true distribution 
T based on the assumption it lies within some nar- 
row set of models. However, this true distribution is 
very likely to be much too complex to be useful, es- 
pecially if we consider the discretization, rounding. 



misrecording and measurement errors incumbent in 
real data. (For example, see the discussion of Ghosh 
and Samanta, 2001, page 1140.) For the duration of 
this article, at least, we ask the reader to believe in 
model- falseness, and further believe that usefulness 
is not necessarily tied to consistency. 

In the next subsection we give an informal intro- 
duction to our methodology. This will be followed by 
a more detailed look at the contents of the paper. 

1.1 Introducing Credibility Indices 

Davies (2002) gave the following definition: 

A probability model Pg is an adequate ap- 
proximation for the data set (xi,...,Xn) 
if "typical" samples {Xi{6) , . . . , Xn{e)) of 
size n generated using Pg "look like" the 
real data set (xi,...,Xn). 

This is clearly an n-dependent assessment, but it 
captures what we consider an important aspect of a 
good model — that it is good at creating data similar 
to the observed data. 

To illustrate our thinking, let us start with the 
most prominent statistical assumption, that the data 
is normally distributed. Surely we might believe that 
no data is exactly normal in distribution, but that 
it is often useful and plausible to assume so. 

Berkson (1938) described the paradox that a good- 
ness-of-fit test may become embarrassingly powerful 
whenever the data are extensive: 

I believe that an observant statistician who 
has had any considerable experience with 
applying the chi-square test repeatedly will 
agree with my statement that, as a matter 
of observation, when the numbers in the 
data are quite large, the P's tend to come 
out small. Having observed this, and on 
reflection, I make the following dogmatic 
statement, referring for illustration to the 
normal curve: "If the normal curve is fit- 
ted to a body of data representing any 
real observations whatever of quantities 
in the physical world, then if the number 
of observations is extremely large — for in- 
stance, on the order of 200,000 — the chi- 
square P will be small beyond any usual 
limit of significance." 
If this be so, then we have something here 
that is apt to trouble the conscience of a 
reflective statistician using the chi-square 
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test. For I suppose it would be agreed by 
statisticians that a large sample is always 
better than a small sample. If, then, we 
know in advance the P that will result 
from an application of a chi-square test to 
a large sample there would seem to be no 
use in doing it on a smaller one. But since 
the result of the former test is known, it 
is no test at all! 

As a response, Hodges and Lehmann (1954) sug- 
gested that the difficulty could be avoided by mak- 
ing distinction between "statistical significance" and 
"practical significance" in the formulation of the 
problem. The idea was to construct a larger hypoth- 
esis Hi of distributions about the null Hq, repre- 
senting distributions that are close enough to Hq so 
that the difference is deemed not practically signif- 
icant with the data at hand. If one let Hi play the 
role of the null hypothesis, then if the true distribu- 
tion is an element of Hi , then one might still wish to 
use the model Hq. Liu and Lindsay (2009) expanded 
upon this idea, but still found difficulty in creating 
a reasonable set Hi having a simple interpretation. 

Conducting a goodness-of-fit test involves two 
choices: the test and the significance level a. Given 
an alternative, there is a resulting type II error f3. 
We start our development by showing how one can 
invert goodness-of-fit testing to develop a new mea- 
sure of model failure. To help fix the idea, we use the 
following example. The full data set consists of the 
diastolic and systolic blood pressure data of 10,529 
persons aged from 35 to 84. We take only the 1239 
normal females as our data to be analyzed, because 
the blood pressures of the full sample would likely be 
better modeled as a mixture of normals. The original 
data was obtained from the Clinical Trials Research 
Unit (CTRU) of New Zealand. Central limit theory 
suggests that such data might be rather normal in 
distribution. After looking at the QQ plot Figure 1, 
where there is little deviation from a straight line 
except at tails, we think many statisticians would 
be happy using a normal model for such data. 

On the other hand, suppose we use the Kolmogorov- 
Smirnov goodness-of-fit test to test the normality 
assumption. The test statistic is the greatest ab- 
solute vertical distance between the empirical dis- 
tribution function of blood pressures and the hy- 
pothetical normal distribution function, evaluated 
on the 1239 sample values. The parameters of the 
normal distribution are estimated from the sample. 



Normality is strongly rejected (p- value =0.0016), a 
fact which we might attribute to the large sample 
size (n = 1239). That is, at such a sample size, we 
have power against what appear to be very small 
deviations from normality. In this example, the nor- 
mality is rejected although data looks quite normal 
at the center. 

How can we say this data is very well described by 
a normal model without saying it is exactly normal? 
Here is one way to use statistical testing to answer 
the question. 

One starts with a goodness-of-fit test method that 
has desirable operating characteristics. That is, it 
should be sensitive to important model failures (al- 
ternatives) but insensitive to trivial model failures. 
We discuss this choice in the next subsection. 

Given a true probability generating mechanism r, 
that is not in the model, and a size a test procedure 
I{Tm{Xi, . . . ,Xm) > Cm}, One can define the power 
curve (3r{m) = Pr{Tm{Xi, . . .,Xm) > Cm}- See Fig- 
ure 2 for such a plot based on the blood pressure 
data. Here r is the empirical distribution of the full 
data set, the test is the Kolmogrov-Smirnov test for 
normality with a = 0.05. As a simple number sum- 
mary of such a plot, we define the maximum credible 
sample size of the postulated model (here the nor- 
mal model in the blood pressure population) to be 
that sample size A^* = A*(t, A^) at which we would 
reject the model M. 50% of time based on a size a 
(<0.5) goodness-of-fit test. We wih also call N* the 
model credibility index. More generally, one could 
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Fig. 1. QQ plot of the Blood Pressure data of 1239 females. 
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define as the sample size needed to attain power 
/3, in which case the index A^* is Nq^. 

Although one might choose other summaries of 
the power curve, such as (-^0,251-^0.75); we find N* to 
be a natural summary. It also creates certain asymp- 
totic simplifications. 

If the model is actually correct, then N* = 00. 
However, if the model is false, there is some finite 
sample size at which the power would reach 0.50. 
Different tests will have different power curves that 
in turn reveal different inadequacies of the model. 

In Figure 2 we assumed that the true distribution 
T is random sampling from our set of 1239 scores, 
and we determined /3(m) by simulation. That is, we 
bootstrapped repeated samples of various hypothet- 
ical sizes m from the 1239 blood pressure values and 
repeatedly conducted the Kolmogorov-Smirnov test 
until we found the m that gave power 0.5. For exam- 
ple, in our example we found when m = 315, the nor- 
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Fig. 2. Plot of test power vs. sample size. 



Table 1 

m at various test sizes and power levels for blood pressure 

data 



Power /3T-(m) 




Test size 




a = 0.1 


a = 0.05 


a = 0.01 


0.3 


115 


200 


410 


0.5 


225 


315 


600 


0.7 


360 


490 


795 


0.9 


540 


695 


1050 



mality assumption was rejected by the Kolmogorov 
test approximately 50% of the time (499/1000). 

The choice of test size is also arbitrary. Table 1 
shows the estimated sample size m when obtaining 
various power Priii^) at a different testing signifi- 
cance. The monotone pattern in the table indicates 
that one would need a larger sample size in order to 
obtain more testing power at a higher test size. 

Based on this analysis, it is clear that it would be 
very hard to detect non-normality in samples of size 
100 from this true distribution (/3(100) =0.13). To 
put this another way, the samples of size 100 must 
"look" very much like samples from a normal dis- 
tribution, and so one might say that normality is a 
good descriptor of the sampling mechanism at this 
sample size. Indeed, this descriptive power holds till 
the sample size approaches 315, when the distinc- 
tion between normal samples and data mechanism 
samples must start to become more obvious. 

1.2 Role of Test Statistics 

What kind of index is N* , in a mathematical sense? 
As we will see later, in a detailed analysis of some 
standard test statistics, it is inversely proportional 
to the squared distance measure that was used to 
construct the test statistic. 

This makes it quite clear that the value of the 
model credibility index A^* depends strongly on the 
test statistic that is being used. If we wish N* to 
refiect usefulness of the model, then the test statis- 
tic must be sensitive to those model failures which 
we consider most important. Thus, the choice of the 
test must reflect our statistical purposes, as well as 
which models we consider to be competitors. For ex- 
ample, if we would consider a t-distribution a useful 
alternative description, having a test sensitivity to 
tail probabilities would be desirable, say, Anderson- 
Darling. 

The Kolmogorov-Smirnov test is a test of normal- 
ity for large samples. One of its limitation is that it 
is more sensitive to deviations in the center rather 
than in the tails. In the blood pressure example, at 
least the center of data is quite normal (Figure 1). If 
one is interested in the tail regions, then one should 
use other tests that are more sensitive to tails. More 
generally, Claeskens and Hjort (2003) develop model 
selection tools which can focus on specific aspects of 
lack of fit. 

While trying out other data sets to use in this pa- 
per, we examined another data set with heights of 
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2603 female adults from the data surveys and collec- 
tion systems of the Centers for Disease Control and 
Prevention (NHANES, 1999-2000). The Kolmogorov- 
Smirnov test for normality of this set gave a p- value 
greater than 0.10. Although this data set didn't meet 
Berkson's criterion of 200,000, it was even more nor- 
mal than the blood pressure set. See Figure 3. We 
found another interesting thing for this heights data. 
The original data is coded in centimeters with one 
decimal accuracy. However, when we rounded the 
data to integer values, the p- value of the Kolmogorov- 
Smirnov test became 0.000, leading to a rejection 
of normality. This illustrates that the Kolmogorov- 
Smirnov test is sensitive to data coding. 

The Shapiro-Wilks VF-statistic (1965) is a well- 
known goodness-of-fit test for the normal distribu- 
tion. It is attractive because it has a simple, graph- 
ical interpretation: one can think of it as the cor- 
relation between given data and their correspond- 
ing normal scores. The Shapiro-Wilks test has good 
power properties across a wide range of alternative 
distributions in comparison with other goodness-of- 
fit tests (Shapiro, Wilk and Chen, 1968). 

For the blood pressure data, normality is also re- 
jected by the Shapiro-Wilks M^-statistic (p- value = 
0.0043). The credibility index is N* = 220 for the 
Shapiro test. 

The chi-square test, introduced by Pearson in 1900, 
is the oldest and best known goodness-of-fit test. 
The idea is to reduce the goodness-of-fit problem to 
a multinomial setting by grouping data and com- 
paring cell counts. Chi-squared tests can be applied 
to any type of variable: continuous, discrete or a 
combination of these. However, grouping the data 
sacrifices information, especially if the underlying 
variable is continuous. For the blood pressure data, 
normality is rejected by the chi-squared test with 
p-value = 0.0000; and the credibility index is N* = 
240. 

In comparing these credibility indices, we recall 
that — even though A^* has a natural sample size 
interpretation — it is \/N* that is the more statis- 
tically meaningful quantity, as it reflects the stan- 
dard deviation scale of uncertainty. (This in turn 
arises, mathematically, because N* is inversely pro- 
portional to the squared distance, making its root 
inversely proportional to the distance.) For these 
tests, the root indices were VSIS = 17.75, \/220 = 
14.83, and \/240 = 15.49, very similar values, albeit 
measures of different model fit features. 



How might one use the A* -index? Certainly in 
any particular data set A* = 315 has its own di- 
rect statistical interpretation. And one can use sim- 
ulation methodology to obtain a better feel for the 
magnitude of A* = 315, as we do in Section 3.3. 
More generally, given a specific testing method and 
type of data set, one could use the A^*-values to ad- 
dress the question as to which data set is a better fit 
to the model and quantify the differences. However, 
the greatest strength of this methodology is that it 
creates a universal tool that transcends particular 
data types and particular testing methods. That in 
turn raises questions as to whether it is possible to 
compare A*-values across different settings in a rea- 
sonable way. In particular, one might ask whether 
an A*-value is large or small given the number of 
parameters included in the model. This last question 
we defer to future research. 

1.3 Estimating N* 

To this point, we have treated A* population 
quantity, where the population in our example is a 
large data set. As such, there is only simulation er- 
ror in our bootstrap estimation. Inference about A^* 
when the large data set is itself treated sample 
of size n from a yet large population, so r is un- 
known, creates some challenging inference problems. 
One can, as before, estimate the power curve /3t("^) 
by averaging over bootstrap samples of size m, but 
now the estimator is not unbiased for l^rim) unless 
we use sampling without replacement, a method we 
will simply call subsampling (see Politis, Romano 
and Wolf, 1999). 

The subsampling framework gives us several tools 
to tackle inferential questions. In a later section we 
will show that we have consistent and asymptoti- 
cally normal estimation of /S-j-^m) when m is fixed 
and n — )• oo. However, in a more realistic scenario 
in which the sampling fraction cj) = m/n is fixed as 
n — 7- oo, the inverse ratio cj)^"^ = n/m is shown to be 
an important measure of the quality of A^* infer- 
ence. When (/>~^ is small, say, 10 or less, then the 
estimator of I3r{m) has considerable uncertainty. 

1.4 Our Contents 

We have now introduced a measure of the credibil- 
ity of a model which depends on the hypothesis test- 
ing methodology, but it comes with a new interpre- 
tation. Note that it is a characteristic of the model, 
the test statistic and the data generating mecha- 
nism, but not the de facto sample size n used to 
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Fig. 3. QQ plot of the heights of 2603 female adults, both original and rounded data. 



estimate it. It is a highly portable statistic, as one 
can use it in any context where there is a known 
goodness-of-fit procedure. However, it is also clear 
that it can only be estimated well when the de facto 
sample size is large enough to make the model in 
question clearly false. 

In this paper we start by discussing how the work 
of Davies inspired our approach in Section 2, and 
reviewing briefly other related literature. We then 
formally define the model credibility index in Sec- 
tion 3. There we also expand upon the normal ex- 
ample so as to compare numerically two-sample and 
one-sample testing approaches and to compare boot- 
strapping and subsampling as methods to compute 
N*. 

In Section 4 we explore the asymptotic properties 
of the power estimators associated with the model 
credibility index. We then in Section 5 examine the 
structure of the model credibility index in greater 
detail in the context of likelihood ratio testing in 
categorical models. We will show how these indices 
are closely related to Kullback-Leibler discrepancy 
measures, and give some further numerical exam- 
ples. Section 6 concludes the paper and proposes 
topics worthy of further investigation. 



2. BACKGROUND 

In this section we will review some related work 
on the conceptual difficulty involved in using models 
while assuming they are false. 

2.1 Distance-Based Indices of Fit 

A more standard approach to model-false analysis 
would be to characterize model fitness by choosing 
a suitable distance measure, then doing inference on 
the distance between the true distribution and the 
model. 

In 1954 Hodges and Lehmann proposed using tol- 
erance zones around the null hypothesis. They con- 
structed Hi as a set of distributions whose distance 
to Hq doesn't exceed a specified bound c under a dis- 
tance measurement. Hodges and Lehmann's analysis 
was in the context of the chi-squared goodness-of- 
fit test. They used a weighted Euclidean distance as 
the distance from a model element to the truth. The 
usual chi-squared distance is included by choosing 
appropriate weights. 

Hodges and Lehmann didn't give a detailed dis- 
cussion on how one should choose c. They mentioned 
that the specification of c would "present problems 
similar to those encountered in choosing the alter- 
native at which specified power is to be obtained." 
This quoted statement presents some difficulties in 
its interpretation and implementation. 
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Liu and Lindsay (2009) expanded on this tubular 
model idea, but used two different distances, like- 
lihood for the test statistic and Kullback-Leibler 
for the tube hypothesis. Their tubular model con- 
sisted of all multinomial distributions lying within a 
distance-based neighborhood of the parametric model 
of interest. The distance between the true multi- 
nomial distribution and the parametric model was 
used as the index of fit. Liu and Lindsay developed 
a likelihood ratio test (LRT) procedure for testing 
the magnitude of the index. 

Goutis and Robert (1998) proposed a Bayesian 
approach for the model selection problem based on 
the likelihood deviation between two nested models, 
called the full and restricted models. The full model 
space was considered to contain the true distribu- 
tion. The Bayesian approach was implemented by 
specifying a prior distribution in the full model, pos- 
sibly an improper prior. Each prior distribution was 
projected onto the restricted model space and the 
corresponding minimum distance measure was com- 
puted. Therefore, the posterior distribution of the 
distance from a prior distribution to the restricted 
model can be derived. Bayesian inference was made 
on the restricted model based on the posterior distri- 
bution. For example, one criterion was to reject the 
restricted model if the posterior probability that the 
distance was less than a certain bound c was small 
enough. Other aspects of the posterior distribution 
could be considered as the testing criteria. When 
one doesn't have a strong prior belief, several priors 
could be used to assess the distance between models. 
The sensitivity of the inference to the priors could 
be used as a factor in making the model choice. 

Dette and Munk (2003) used the Euchdean dis- 
tance in the problem of testing for a parametric 
form hypothesis in regression. They assumed that 
the true model was an unknown nonparametric re- 
gression function. Goodness of fit was measured by 
the Euclidean distance between the unknown true 
regression function and the parametric model. 

Dette and Munk first estimated the Euclidean dis- 
tance under the null hypothesis. To obtain the dis- 
tance under the alternative, the classical concept of 
analysis of variance was generalized to the nonpara- 
metric setting. Their goodness-of-fit statistic mea- 
sure could be interpreted as the difference between 
variance estimators under the null model and the 
nonparametric model. 

The challenge one faces with all approaches that 
use distances directly, such as those described above. 



is that it is very difficult to give statistically mean- 
ingful interpretations to the numerical values of the 
distance. The credibility indices we have explored 
here are, in essence, reciprocals of such distances. 
However, we believe that they are easier to interpret, 
as they measure the ability of the model to describe 
samples of various sizes. They are also more univer- 
sal, having meaning across a wide range of settings. 

2.2 Davies 

In our search for a reasonable way to measure how 
well a model describes a data generating mechanism, 
we came across the work of Davies. 

Davies (1995) proposed the idea of judging model 
adequacy using the concept of data feature. The ba- 
sic idea is that if samples that are simulated from 
the model are largely indistinguishable from the real 
data, then the model should be regarded as ade- 
quate. A similar idea is expressed in Donoho (1988) 
via the following statement: "No distribution which 
produces samples very much like those actually seen 
should be ruled out a priori." 

Davies' formal theory of data features is very sim- 
ilar to hypothesis testing for goodness of fit, with 
the test statistics being designed to assess whether 
the data had the same features as a sample from the 
model. In common with testing theory (but contrary 
to us), he measures the adequacy of models from the 
null-centric convention (i.e., that the model is cor- 
rect) and does so at the de facto sample size. 

Another distinction from our approach is that rather 
than using model-based one-sample test statistics, 
he would use a nonparametric two-sample test to 
compare the data not with the model, but with sam- 
ples from the model. This has the conceptual advan- 
tage of being a direct answer to the question "Does 
this data look like a typical sample from the model?" 

The disadvantage to this approach is that it limits 
the number of testing procedures available for model 
assessment. We believe that a one-sample test is ad- 
dressing the right question, but it does have more 
power because it removes sampling uncertainty. An 
example in Section 3.3 shows that there would be a 
substantial change in magnitude of N* if we used a 
two-sample approach. 

3. CREDIBILITY INDEX 

3.1 The Formal Definition 

In constructing these indices, we have used the 
conventional test size a = 0.05. For a given test, we 
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let N* = N*{t,M) be the value of n that gives this 
test power 0.5 at true distribution r, when the model 
is M. Any test that is consistent for every alterna- 
tive hypothesis (i.e., an omnibus test of fit) will give 
a finite N* under the false model assumption. 

The choice of a here seems like an arbitrary el- 
ement, but we will see later that it plays a minor 
role in the comparison of A^* -values. The choice of 
power 0.5 is also somewhat arbitrary, but there are 
two strong reasons behind this choice. First, there 
is the intuitive appeal of the idea that the model 
decision is 50/50 at this point and so the decision is 
"up for grabs." The index is the middle value of the 
power curve and so provides a natural one number 
summary (e.g., Figure 2). Second, this value of the 
power greatly facilitates the asymptotic analysis, as 
we will soon see. 

In an intuitive sense, the model credibility index 
N*{t,M.) operates reciprocally to distance in the 
following sense. When a true distribution r is moved 
closer to the model, so the distance is reduced, the 
sample size index should increase because a larger 
sample size n would be needed for discrimination be- 
tween T and M. Typically goodness-of-fit test statis- 
tics are based on distance measures; in these cases 
the reciprocal connection can be made more precise, 
as we will soon see. 

3.2 Determination of N* 

One attractive thing about the testing index N* 
is that it admits an elementary subsampling esti- 
mation. This could be carried out in a typical IID 
setting as follows. 

Given a target size a and a data set xi, . . . ,Xn, 
suppose one would ordinarily conduct the goodness- 
of-fit test of the model based on an asymptotic crit- 
ical value. One could then estimate N* for this test 
procedure by conducting a nonparametric bootstrap 
simulation using various sample sizes m to estimate 
the power Prim), the goal being to find the value 
of m such that Prin^) = 0.5. If we let the symbol F 
represent the empirical distribution, we are treating 
F as r, and calculating N* = N*{F,M). Now as- 
suming the model A4 does not include the empirical 
distribution F, the bootstrap sampling distribution 
is under the alternative, and so the rejection proba- 
bility, which is the power of the test, should increase 
in m. The female blood pressure example in Sec- 
tion 1 is an example of the bootstrap determination 
of N*. 



Table 2 

Power of the Kolmogorov-Smirnov test 
(two-sample method) to detect the difference 
between normal and logistic distributions at 
selected sample sizes 



m 




Rejection proportion 




100 


0.044 




500 


0.116 




1000 


0.169 




2000 


0.361 


N* 


= 2650 


0.513 




4000 


0.768 




6000 


0.907 



As we will explain later, there are good reasons to 
use sampling without replacement ("subsampling") 
instead of with replacement ("bootstrapping"). In 
subsampling the largest possible value of m is n, and 
the resulting estimated power j3{n) is 1, if the test 
rejects, and 0, if the test accepts. This reflects our 
lack of knowledge (in the model false world) about 
the model's capacity to explain future samples of 
size n or larger. 

To carry out a subsampling or a bootstrap de- 
termination of A*, one needs to define an efficient 
algorithm so as to minimize computation time. Ob- 
viously, sensible interpolation methods should be 
used. Moreover, it would be nice to have a good 
starting value based on asymptotic approximations. 
See Section 5.1 for more on this issue. 

3.3 One-Sample and Two-Sample Indices 

In this section we use a particular simulation model 
to compare different ways of computing A*. We 
start by comparing one- and two- sample credibility 
indices. In this process we also learn something more 
about how to interpret the magnitude of a model 
credibility index. 

Suppose we draw two samples of size m, say, one 
each from a normal and a logistic distribution, where 
the parameters are chosen to make the distributions 
as similar as possible. We could measure their simi- 
larity by using a two-sample test to see if the samples 
are detectably different. Doing this repeatedly gives 
us the power of the two-sample test between the two 
distributions. 

We did this using the two-sample Kolmogorov- 
Smirnov test, using 1000 samples for each m. Ta- 
ble 2 lists the number of rejections for various sam- 
ple sizes. 
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Suppose we let the model credibility index N* be 
the value of n that gives power 0.50. In this exam- 
ple, N* ~ 2650. We found it quite striking that the 
normal and logistic models would be so poorly dis- 
criminated on the basis of this test. 

A one-sample version of this index could be cre- 
ated by fixing the normal density as the null hy- 
pothesis, and investigating the power of the one- 
sample Kolmogorov-Smirnov using logistic samples. 
As seen in Table 3, this test is considerably more 
powerful than the two-sample one. 

Note that this analysis also shows that A^*, when 
the model is normal and the true distribution is lo- 
gistic, is about 485, and so logistic samples are closer 
to normality than is the blood pressure data set. 

Finally, we use this example to compare the bias 
and deviation of N* when estimated by bootstrap 
simulation with A^* when estimated by subsampling 
simulation. Consider a large data set of size n from 
the logistic distribution. We let m be fixed and sim- 
ulate the powers of the one-sample Kolmogorov test 
for normality by bootstrapping and by subsampling. 
We take 500 data sets from logistic distribution at 
each size n. The simulated average and standard 
deviation of power are in Table 4 for m = 485 and 
n = 1000, 10,000 and 100,000. 

The true power for the infinite population is ap- 
proximately 0.5. The results show that as the em- 
pirical data size n gets larger and larger, the simu- 
lated power gets closer and closer to the true value. 
Although the standard deviations are almost the 
same for the bootstrap method and the subsam- 
pling method, the simulated power by bootstrap is 
much more biased for small n. With the bootstrap 
method, sample size 485 is estimated to have 0.66 

Table 3 

Power of the Kolmogorov-Smirnov test 
(one-sample method) to detect the difference 
between normal and logistic distributions at 
selected sample sizes 



m 


Rejection proportion 


100 


0.126 


400 


0.435 


450 


0.479 


N* = 485 


0.500 


500 


0.518 


1000 


0.824 


2500 


1.000 



power when n is 1000. That again indicates that es- 
timation of A^* by bootstrap tends to have a down- 
ward bias. 

The reader should note the large standard devia- 
tion when n = 1000. The last two columns will be 
discussed later in the context of understanding how 
well one can estimate power nonparametrically. 

4. ASYMPTOTIC ISSUES IN POWER 
ESTIMATION 

In this section we examine the asymptotic proper- 
ties, as n — )• oo, when one estimates the power curve 
Prim) by subsampling or bootstrapping. 

Suppose our test statistic is T„ = T„(xi, . . . ,Xn), 
symmetric in its arguments. Suppose our test pro- 
cedure is to reject Hq when {T„(Ai, . . . > Cq,}, 
where Cq, is an asymptotic critical value for the test. 
The object of interest is 

l3{m)=Pr{TmiXi,...,Xn)>C^}. 

When the null hypothesis is true (i.e., includes r), 
we have Pr{Tn{Xi, . . . , Xn) > Ca} — )• a as ?i — )■ oo. 

We will derive asymptotic results for subsampling 
based estimation of f3{m), with side notes on the 
effect of using bootstrap sampling instead. Notice 
that I{Tm{Xs^, . . ■ , Xg^) > Ca}, for any set of dis- 
tinct integers ai, . . . ,am, is an unbiased estimator 
of /3(m) . Let 5 = {si , . . . , s^} be a subset of m dis- 
tinct integers sampled from { 1 , . . . , n} , and let Xs = 
(X,, ,...,X,J. Finally, let i^„,(As) = I{T^iXa, 
Xa„J > Cq}. We can construct a [/-statistic estima- 
tor of /3(m) by 

where S is the set of all distinct subsets of {1, . . . , n} 
of size m. We can also write this as an expectation: 

(4.1) [/comp(A) = E[Km{Xs)\Xi, . . . 

Here the expectation is over samples of m integers 
without replacement from {1, . . . , n}, with X = (Ai, 
. . . , Xn) fixed. 

We will call this the complete [/-statistic; in prac- 
tice, we are unlikely to use it because of the (J^) cal- 
culations required. The approximation we consider 
will replace this exact expectation with a subsam- 
pling estimator created by randomly sampling S. 
Another possible computational shortcut would be 
to use a statistical design for the selection of a subset 
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Table 4 

Simulated power of normality test for finite population from logistic 



Bootstrap Subsampling Subsampling 

n Mean Deviation Mean Deviation = — EISS 

1000 0.659 0.1753 0.493 0.2472 ^=2.06 4.14 

10,000 0.517 0.0710 0.499 0.0735 = 20.6 44.3 

100,000 0.503 0.0278 0.501 0.0252 iSMOo = 206.2 393.7 



of S (Blom, 1976). We will focus here on the proper- 4.2 Fixed Sampling Ratio Asymptotics 



ties of f/comp itself, corresponding to an ideal infinite 
subsampling scheme. In this setting, we can think of 
the estimator obtained by bootstrap subsampling as 
being the corresponding ^-statistic estimator of /?. 

4.1 Fixed m Asymptotics 

We can now make some observations about the 
consistency of this form of estimation. The answer 
depends on the asymptotic setting. If we assume 
that m is held fixed as n — )■ oo, fixed m asymptotics, 
then we can apply the following standard U -statistic 
theory, and obtain consistency and asymptotic nor- 
mality for the estimation of /3(m) as follows. 

The exact and asymptotic variance of f/comp is 
described in Theorem 4.1 (Lehmann, 1999). 

Theorem 4.1. If Yai[Km{xi,...,Xi,Xi+i,..., 
^m)] = (yf, then: 

(1) The variance of the U -statistic is equal to 

m 

Var(C/comp) 

i=l 

(2) If (t\ > and af < oo for all i = 1, . . . ,m, then 

Var(^[/comp) m'^af. 
Theorem 4.2 gives the asymptotic normal prop- 
erty of C/comp- 

Theorem 4.2. (1) // < erf < oo, then as 

DO, 

^/^(f/comp-/3) AA^(0,mV?); 
(2) // af < oo for all i = 1, . . . ,m, then 



n — m 
m — i 



comp 



iV(0,l) 



^Var(C/comp) 

Because for us Km is an indicator function, the 
condition that o"^ < oo for all i is obviously satisfied. 
When m is fixed, these limiting distribution results 
hold for the bootstrap estimator of I3{m) because it 
is the corresponding ^/-statistic. 



Unfortunately using fixed m asymptotics is in- 
credibly optimistic in our setting, as we wish to 
be able to estimate /3{m) for m as close to n as 
possible. The more realistic asymptotics we will use 
to study this case will consider sequences in n in 
which m = rUn is some fixed fraction <j) oi n, which 
we call fixed ratio asymptotics. In this setting the 
target value (^^{mn) will be changing in n, going to 
1, and so we also need to consider local alternative 
sequences r^. 

To study this, we first derive some properties of 
Var(C/comp("^)). For any two independent samples 
Si and ^2 of size m from {1, 2, . . . , n}, let l^i n = 
0{Si,S2) be the number of common elements. We 
will call 0{Si,S2) the sample overlap. It has a hy- 
pergeometric distribution, so it is an elementary cal- 
culation to show that E{0{Si, 82)}/!^ = mjn = cj). 
That is, the sampling fraction (j) is also the mean 
fractional overlap between subsamples. We can then 
write 

m 

(4.2) =Y,E[Km.iSi)KmiS2)\0{Si,S2) = k] 

k=0 

xPr[0(5i,52)=fc]. 

As we will show below, the [/-statistic can suffer a 
severe degradation in variance, relative to the fixed 
m asympotics, if the mean overlap (f> in the indices 
is too large. (Note that (j) goes to zero in fixed m 
asymptotics, so the overlap mean goes to zero.) As 
a way to measure the overlap effect, we define an 
equivalent independent sample size (EISS) measure 
using the formula 

For our indicator kernel Km this gives the formula 



Var([/comp(m)) 



EISS 
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and so we can think of EISS as being the sample size 
we would need to conduct an IID experiment with 
equivalent accuracy in estimating /3{m). 

From a standard U -statistic inequality (Blom, 1976, 
page 574), we have 

n/m 

(4.3) 

_ /3rim){l - /3rim)) 
n/m 

As a consequence, we are guaranteed consistent es- 
timation of fSrnimn), along any sequence of alter- 
natives Tn, when (f) = mn/n goes to zero. We note 
that bootstrap resampling does not have this strong 
guarantee of consistency, as general results require 
m?/n to go to zero (Politis, Romano and Wolf, 1999). 

This inequality also implies that EISS > = 
n/m. That is, gives us a lower bound for EISS 
for /3(m) inference. For example, a sampling fraction 
of 4) = 1/25 is guaranteed to provide at least as ac- 
curate an estimation of p = /3(m) as would 25 draws 
from a Bernoulli distribution with success probabil- 
ity p. As we will later see, this inequality can also be 
thought of as an approximation when is small, 
helping to give one the proper degree of pessimism 
about N* inference in this case. 

4.3 Local Alternatives: A Closer Look 

To more closely examine this approximation, we 
consider certain local alternatives Tn to the null hy- 
pothesis. We will assume now that the test statistic 
at hand admits a standard local asymptotic analysis 
under alternatives of the form r„ = F() + n~^^'^cg{x), 
for fixed g{x), positive c and null element Fq. In 
this setting one can typically show that /3T-„{n) — t- 
Aoc(c) as n— )• oo, where the local alternative power 
curve /3ioc(c) is a continuous increasing function of 
c. For example, for Pearson's chi-square test, the 
local analysis leads to a noncentral chi-square dis- 
tribution. (See Ferguson, 1996, page 63.) To find the 
local power along the sequence t„ when a different 
sample size is used, say, ?n.„ = 0n, we can rewrite 
the alternative as 

Tn = Fo + m-^/^4>^/^cg{x). 

The sample size changes the scaling factor from c to 
^^/^c. Hence, the asymptotic power approximation 
for samples of size mn from Tn is (3ioc{4>^^'^c). Assum- 
ing that c is chosen so that Aoc(c) > 1/2, there will 

1/2 

be a fraction (/jq.s such that Aoc(0o 5 ^) ~ ^/'^- That 



is, if we choose 4> = 4)o.^, we have /3t-„ ("^n) — ^ 0.5, for 
"^n = 00.5^^- As a consequence, the true A^* value for 
the Tn sequence grows proportionally to n, namely, 
^0.5 X n. 

Since (j) = m/n is fixed, our proceeding result about 
the consistency of C/comp is not operative. In fact, in 
local alternative settings, the estimator is generally 
not consistent. However, it is possible to obtain use- 
ful understanding of how the variance changes as a 
function of <j), and so examine its role in estimation. 
Returning to the formula 

m 

E{Ul^^{m)) = Y,E[Km{S{)Km{S2)\0{Si,S2) = k] 

xFr[0{SuS2) = k], 

the second term on the right has the elementary 
calculation 



Pr[0(5i,52) = A;] = 




This hypergeometric distribution has mean (p^ = 
(m/n) ■ m and variance bounded above by m(j){l — 
(j)), the corresponding binomial variance. Hence, 
0{Si, S2)/m, the fractional overlap, converges in prob- 
ability to (p in our asymptotic setting. 

For this reason, it is reasonable to approximate 
the terms 

E[K„,{Si)K^{S2)\0{Si,S2) = kn] 

along a sequence of /c's for which the samples have 
a fixed fractional overlap, say, kn = amn = acpn, in 
order to approximate the important terms in the 
variance. 

Although such a task is dependent on the struc- 
ture of the test statistic, we think it is worthwhile 
to illustrate here how these calculations could be 
carried out. We consider a test statistic which is 
asymptotically chi-squared distributed, with degrees 
of freedom d under the null hypothesis, and is asymp- 
totically noncentral chi-squared, with noncentrality 
parameter 5 under the local alternatives sequence. 

If we let G{t) = Pr{x^_i > t}, then for fixed over- 
lap fraction a, then under standard local asymptotic 
calculations, 

E[Km{Si)Kn,,{S2)\0{Si,S2) = kn]^A, 

where A can be calculated as the expectation of 
\l — a 
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(4.4) 




Table 5 

Simulated EISS for various sampling fraction <j) 



X G Ca 

\ 1 — a 




where X, Y, Z are independent normal variables and 
W is independently Xd-i- 

In Table 5 we show some calculations from this 
formula for d = 25, where 6 is chosen as 3.67 so as 
to obtain asymptotic power 0.5. The critical value 
is co.05 = 37.66. 

We note several features here. First, cl)~^ is rela- 
tively conservative, but for small values does pro- 
vide the right caution. Here = 10 gives an EISS 
of 32.6, something like a bare minimum needed for 
N* inference. If we compare this table with the val- 
ues from the simulation in Table 4, we see that in 
the latter, EISS was about 2 x across a larger 
range of sampling fractions, and so did not show the 
steady improvement found in Table 5. 

5. CREDIBILITY IN CATEGORICAL DATA 
MODELS 

Our setting for analyzing the mathematical fea- 
tures of credibility indices more carefully will be 
likelihood ratio tests in categorical models. 

5.1 Asymptotic Approximations 

We derive two approximations to A'^* here, fo- 
cusing on the likelihood ratio test in multinomial 
models. Here the data will be an IID sample from 
a multinomial distribution, as summarized by the 
counts n(t) in the cells t = 1, . . . ,T. The cell propor- 
tions will be denoted d{t) = n{t)/n, which represent 
the empirical distribution d of the data. The model 
A4 will have elements Fg{t) representing a paramet- 
ric model for the multinomial cells — for example, a 
log-linear model. The testing statistic will be the 
likelihood ratio, and we will assume that the test 





EISS 




EISS 




EISS 


2 


4.2 


15 


52.9 


50 


231.6 


3 


7.4 


20 


74.6 


60 


294.1 


4 


10.7 


25 


97.7 


75 


398.0 


5 


14.1 


30 


122.0 


80 


435.6 


10 


32.6 


40 


174.3 


100 


601.7 



statistics have the standard asymptotic chi-squared 
distributions under the null models. 

In this context we can derive a simple asymptotic 
version of the testing index and show that it is pro- 
portional to a reciprocal squared distance. This in 
turn leads to an elementary consistent estimator of 
the asymptotic index. This estimator has two im- 
portant uses: It can be used for a preliminary value 
of the index for bootstrap or subsampling testing. It 
can also itself be bootstrapped or subsampled, which 
then provides a simple way to assess the variability 
of the estimated index. 

The likelihood deviation between a multinomial 
distribution p and a model element Fg is defined 
as L'^{p,Fg) = ^p{t)log{p{t)/Fg{t)). This is a ver- 
sion of the Kullback-Leibler distance; we call it the 
likelihood deviation to clarify the asymmetric role 
of p and F. Technically it operates as a squared dis- 
tance, which is why we use the superscript 2. We also 
define the likelihood deviation from a multinomial 
distribution p to the model Ai to be 



(5.1) 



L\p,M)=miL\p,Fg). 



For the true sample distribution r, if the infinum is 
attained at a particular 9, it will be denoted Or, and 
the model element that approximates r is therefore 
denoted Fg^. 

In the likelihood ratio test, one rejects the null 
hypothesis Hq: r G 7W at asymptotic size a, if the 
likelihood ratio test statistic is large enough, that is, 

2nL\d,M)>x%{a), 

where x'dfi'^) is the upper 1 — a quantile of chi- 
squared distribution with df = the degrees of free- 
dom. The power of the test at sample size n when 
dn^T is 

Pr{2nL\dn,M)>X%{a)]- 

Our goal is to determine the sample size N* at which 
the testing power for the alternative r ^ is 0.5. 
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That is, 

Pr{2N*L\dN*,M) > x%{a)} = 0.5. 

Our first approximation to N* uses the fact that 
when the model is false, the centered likelihood ra- 
tio statistic has, asymptotically, a centered normal 
distribution. The approximation, as derived in the 
Appendix, is 



(5.2) 



N* 

asy 



V) 



Here our choice of the power 0.5 greatly simplifies 
the expression. Other choices for would depend 
on the limiting variance for the normal distribution. 

Our second approximation is a bit more sophisti- 
cated. We consider local alternatives that approach 
the null as the sample size goes to infinity. This gives 
a noncentral chi-square approximation: 



(5.3) 



N. 



asy2 



r) 



X2(r,7W)' 

In equation (5.3), X'^{T,Ai) is the Pearson chi- 
square distance, 

F)2 



F 



and (5*)2 is the noncentrality parameter that satis- 
fies 

(5.4) P{x%{{S*f)>X%{a)} = 0.5, 

where x'dfi^"^) ^ noncentral distribution with 
degrees of freedom df and noncentrality parame- 
ter {5*f. One can generalize this approximation by 
changing the right-hand side of (5.4) to a chosen 
power level. See the Appendix for more details. 

The second approximation should be more accu- 
rate than the first for situations when r is close to 
the model. Notice that both approximations (5.2) 
and (5.3) show an inverse relationship to squared 
distance. Moreover, we can see that a plays a role 
only in the numerator of the approximation. Given 
two models with the same testing degrees of free- 
dom, the ratio of approximate A^*-values does not 
depend on a. 

Another useful feature of N* arises in confidence 

asy 

assessment. One could form asymptotic confidence 
intervals for N*{t) by bootstrapping N* , but this 
requires double bootstrapping, an expensive possi- 
bility. But bootstrapping N^^^d) is relatively inex- 
pensive and it can give a useful picture of the un- 
certainty involved. More rigorous methods of using 
subsampling to estimate standard errors are under 
investigation by the authors. 



5.2 Numerical Examples 

We next assess model credibility for the data in 
Tables 6 and 7. Table 6, considered earlier by Snee 
(1974), is a 4 X 4 table cross-classifying eye color 
and hair color. The sample size n = 592 is somewhat 
large, but the table does have some small entries. 
The Pearson statistic for the independence model 
is X'^ = 138.290 on 9 degrees of freedom, and the 
likelihood ratio statistic is = 146.444. The model 
would be rejected on the basis of these quantities. 

We tested the independence model for the data 
in Table 6, where the degrees of freedom are 9. We 
then apply the two approximations, (5.2) and (5.3), 
to obtain the starting value for A^*(d), which are 



N. 



asy 



(d) =34andAfiy2(d) 



37. 



We further refine the preliminary value by boot- 
strap. Given the target size a = 0.05, we took vari- 
ous sample sizes m, then generated B = 1000 boot- 
strap samples from Multinomial(m, d), with mar- 
gins not fixed. We then conducted the size a likeli- 
hood ratio test, and recorded the fraction of rejec- 
tions, #{2nL2(d^,A^) > x%ia)}/B. The estimate 
of N*{t), A^*(d), would be that sample size that 
gives rejection fraction 50%. See Table 8 for the 
numbers, as well as a comparison of bootstrap and 
subsampling in this example. 

In this case A^*(d) = 32, which is very close to 
the first asymptotic value of 34. A 95% bootstrap 
interval for -/V*gy(r) was found to be (25,43). Note 
that cl)~^ = 592/32 = 18.5, suggesting that inference 
about N* is reasonable. 

Diaconis and Efron (1985), in addressing the same 
problem posed by this paper, suggested a different 
way of generating an assessment of this particu- 
lar data set. They compared the observed X^-value 
with those of all possible 4x4 tables with n = 592. 
They found that, among all 4 x 4 tables with n = 592 
(margins not fixed), approximately 10% have X^- 
values less than 138.29. They concluded that the 

Table 6 

Cross-classification of eye color and hair color (size n — 592^ 

Hair color 



Eye color 


Black 


Brunette 


Red 


Blonde 


Brown 


68 


119 


26 


7 


Blue 


20 


84 


17 


94 


Hazel 


15 


54 


14 


10 


Green 


5 


29 


14 


16 



14 
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Table 7 



Cross-classification 


of number 


of children 


by annual 


income 




(size n = 


25,263; 










Annual income 




No. of children 


0-1 


1-2 


2-3 


3+ 





2161 


3577 


2184 


1636 


1 


2755 


5081 


2222 


1052 


2 


936 


1753 


640 


306 


3 


225 


419 


96 


38 


4+ 


39 


98 


31 


14 



given 4x4 table does not lie particularly close to 
independence. 

Our second example, Table 7, originally published 
in Cramer (1946), is a 5 x 4 table cross-classifying 
number of children by annual income levels. The 
sample size is n = 25,263, which is very large. The 
goodness-of-fit statistics are = 568.566 and = 
569.420 on 12 degrees of freedom. The x^-statistics 
have extremely small p-values, leading to rejection 
using the conventional criteria. 

Diaconis and Efron (1985) used this example as 
well. They found that, among all 5 x 4 tables with 
n = 25,263 (margins not fixed), the proportion of 
those having less than 568.576 is 2.1 x 10"'^. 
They concluded that the observed table is extremely 
close to independence, which is dramatically oppo- 
site from the conclusion drawn from the x^-values. 

The credibility index for Table 7 was calculated 
as follows. The starting estimate value of N^^y{d) 
for the data in Table 7 was 470 and its bootstrap 
range was (386,548), while A^*gy2(d) = 439. We re- 
fined the estimate to A^*(d) = 425 using the boot- 
strap procedure (margins not fixed). Here the close- 
ness of the model and sample explains why N*^^^, 
worked better as a bootstrap starting value. Note 
that (/>-^ = 25,263/425 = 59.4, suggesting that in- 
ference on A'^* is reasonable. See Table 8 for more 
details. 

It is clear that Table 7 lies much closer to the inde- 
pendence model than Table 6. Using the credibility 
index guide, we would say that the row-column 
independence model is credible only for samples of 
size = 32 or smaller for the population repre- 
sented by Table 6. Table 7 is credible for samples 
that are more than ten times as large. 

The magnitude of the ratio for the Efron-Diaconis 
statistics is on a completely different scale, being 
4.8 X 10^. Of course, the statistics involved are quite 



different in interpretation. The Efron-Diaconis statis- 
tic and our index are not asking the usual questions 
for contingency tables. The Efron-Diaconis statistic 
seems to ask "is this table surprisingly close to inde- 
pendence?" It is calculated by assuming that prior 
to data collection, every possible table of that sam- 
ple size was equally likely. We ask instead, "does 
this table come from a population that generates 
samples that look independent, even for large n?" 

6. DISCUSSION 

The statistical community is currently facing an 
enormous challenge (and opportunity) that arises 
from the new data generating capacity of science and 
engineering. This paper has been concerned with the 
question: "How should we reconcile our parametric 
modeling tools with the fact that in a truly large 
data set, parametric models are either clearly false 
or are too complex to be concise descriptors of the 
key data features?" We have tackled one small part 
of this problem, assessing the quality of a model's 
fit while assuming it is false. We have done so by 
modifying hypothesis testing methods so that they 
can be used from a model false perspective. 

If model credibility indices are a good idea, then 
many questions remain. For example, can we design 
the test procedures, and the corresponding N* val- 
ues, that would reassure us about the robustness of 
using a standard model-based statistical procedure? 
Is there a good way to use N* quantifying, in an 
absolute sense, what it means for a model to be a 
surprisingly good fit to a set of data, as in saying 
that a data set is "highly normal"? The theoreti- 
cal development of this idea might involve compar- 
ison of the credibility of the chosen model with a 
randomly selected model with the same number of 
parameters. 

Another issue regards the comparison of A^*-values 
in models across differing numbers of parameters. 
One possibility is to create an index that adjusts for 
the number of parameters, such as N* /{^ parame- 
ters). The form of such an index then could depend 
on how we might "expect" A^* to grow when the 
number of parameters grows, given a sequence of 
arbitrary models. 

Although we recognize that the ideas presented 
here are only a beginning, we hope the reader has 
found them to be stimulating. 
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Table 8 



Summary 


of sample sizes 


and the corresj. 


londi 


ng power for data in 


Tables 6 and 7 


Power for Table 6 






Power for Table 7 


m 


Bootstrap 


Subsampling 


m 




Bootstrap 


Subsampling 


34 


0.676 


0.568 




470 


0.578 


0.548 


JV* =32 


0.505 


0.497 




450 


0.544 


0.529 


31 


0.512 


0.484 




430 


0.505 


0.507 


30 


0.481 


0.474 


N* 


= 425 


0.495 


0.500 


29 


0.480 


0.467 




400 


0.482 


0.479 



APPENDIX: TWO APPROXIMATIONS TO N* 

A.l Approximation Through Normal Distribution 

We can obtain a quick-and-dirty approximation 
using the fact that — when the model is false — the 
centered likelihood ratio statistic has, asymptoti- 
cally, a centered normal distribution. 

Lemma A.l. //{n(t)} are a multinomial sample 
of size n from a fixed distribution r not in M , then 
as oo, 

V^{L\dn,M) - L\t,M)) ^ N{0,a^), 

provided that the asymptotic variance cr^ is not zero 
or infinity. 

The lemma is just the maximum likelihood within 
von Mises' framework (Serfling, 1980, page 211). 
Freitag and Munk (2005) have a bootstrap variant, 
which is an interesting extension of the lemma. 

Note that this lemma applies to bootstrap sam- 
pling from the empirical distribution d{t) (treating 
it as r) whenever the data d{t) is not perfectly fit by 
the model. Now the value of N that we seek satisfies 



> 



1 



Xif{a)-VNL'{T,M)\=0.5 



2VN 

Since the left-hand term is asymptotically normal 
with mean zero, this suggests that we need to 
solve 

1 ,2 /„A ,n^T2/ 



2VN 



Xif{a)-VNL'{T,M) = 0. 



Note that this calculation is independent of the un- 
known cr^ due to the choice of power 0.50. It gives 
us the approximation 



(A.l) 



a) 



Thus, the asymptotic version of N* is inversely pro- 
portional to the squared likelihood deviation. 

Of course, our argument was somewhat specious: 
one cannot simultaneously let N go to infinity and 
solve for finite N. Regardless, N*^^ provides an ele- 
mentary and useful approximation to the index N* , 
both its theoretical value (sampling under r) and 
the estimator (sampling under d). 

A. 2 Second Approximation to N* Using 
Noncentral Chi-Square Distribution 

One could construct more sophisticated asymp- 
totic approximations of N*. One method would be 
based on using "local alternatives" ; that is, based on 
letting the alternatives approach the null, as n — )• oo, 
obtaining noncentral chi-square approximations. 

We imagine a sequence of true alternatives with 
Tm = (1 — m~^/^)F + m^^/'^g, where F is a model 
element and g is some fixed alternative not depend- 
ing on m. Therefore, the likelihood ratio test statis- 
tics 2mL^{dm, F) — > x'dfi^^) as m — )• oo under r^, 
where 6^ =X'^{g,F), the Pearson chi-squared dis- 
tance, X^((7 — F)'^/F, and x'dfi^'^) is a noncentral 
chi-square distribution with degrees of freedom df 
and noncentrality parameter 6"^ (Agresti, 2002). 

Therefore, one can obtain the power as a function 
of m at a fixed g, based on the sequence of Tm- How- 
ever, what we want is the power at a particular r, 
which we can approximate by inventing a different 
g for each m. At the targeted m. 



T = T„ 



:i 



m 



-1/2 



implies 

g^ = F + m^/^{T-F). 

This gives the corresponding noncentrality param- 
eter 

(9m - F? 



F 



mX\T,F). 
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We then get the power at r for large n being ap- 
proximately 

One can find the noncentrality parameter [6*)'^{df) 
such that 

then A^* can be approximated by 



(A.2) 



N. 



X^{t,F) 
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